arrow-array 54.0.0

Array abstractions for Apache Arrow
Documentation
The central type in Apache Arrow are arrays, which are a known-length sequence of values all having the same type. This crate provides concrete implementations of each type, as well as an [`Array`] trait that can be used for type-erasure. # Building an Array Most [`Array`] implementations can be constructed directly from iterators or [`Vec`] ``` # use arrow_array::{Int32Array, ListArray, StringArray}; # use arrow_array::types::Int32Type; # Int32Array::from(vec![1, 2]); Int32Array::from(vec![Some(1), None]); Int32Array::from_iter([1, 2, 3, 4]); Int32Array::from_iter([Some(1), Some(2), None, Some(4)]); StringArray::from(vec!["foo", "bar"]); StringArray::from(vec![Some("foo"), None]); StringArray::from_iter([Some("foo"), None]); StringArray::from_iter_values(["foo", "bar"]); ListArray::from_iter_primitive::([ Some(vec![Some(1), None, Some(3)]), None, Some(vec![]) ]); ``` Additionally [`ArrayBuilder`](builder::ArrayBuilder) implementations can be used to construct arrays with a push-based interface ``` # use arrow_array::Int16Array; # // Create a new builder with a capacity of 100 let mut builder = Int16Array::builder(100); // Append a single primitive value builder.append_value(1); // Append a null value builder.append_null(); // Append a slice of primitive values builder.append_slice(&[2, 3, 4]); // Build the array let array = builder.finish(); assert_eq!(5, array.len()); assert_eq!(2, array.value(2)); assert_eq!(&array.values()[3..5], &[3, 4]) ``` # Low-level API Internally, arrays consist of one or more shared memory regions backed by a [`Buffer`], the number and meaning of which depend on the array’s data type, as documented in the [Arrow specification]. For example, the type [`Int16Array`] represents an array of 16-bit integers and consists of: * An optional [`NullBuffer`] identifying any null values * A contiguous [`ScalarBuffer`] of values Similarly, the type [`StringArray`] represents an array of UTF-8 strings and consists of: * An optional [`NullBuffer`] identifying any null values * An offsets [`OffsetBuffer`] identifying valid UTF-8 sequences within the values buffer * A values [`Buffer`] of UTF-8 encoded string data Array constructors such as [`PrimitiveArray::try_new`] provide the ability to cheaply construct an array from these parts, with functions such as [`PrimitiveArray::into_parts`] providing the reverse operation. ``` # use arrow_array::{Array, Int32Array, StringArray}; # use arrow_buffer::OffsetBuffer; # // Create a Int32Array from Vec without copying let array = Int32Array::new(vec![1, 2, 3].into(), None); assert_eq!(array.values(), &[1, 2, 3]); assert_eq!(array.null_count(), 0); // Create a StringArray from parts let offsets = OffsetBuffer::new(vec![0, 5, 10].into()); let array = StringArray::new(offsets, b"helloworld".into(), None); let values: Vec<_> = array.iter().map(|x| x.unwrap()).collect(); assert_eq!(values, &["hello", "world"]); ``` As [`Buffer`], and its derivatives, can be created from [`Vec`] without copying, this provides an efficient way to not only interoperate with other Rust code, but also implement kernels optimised for the arrow data layout - e.g. by handling buffers instead of values. # Zero-Copy Slicing Given an [`Array`] of arbitrary length, it is possible to create an owned slice of this data. Internally this just increments some ref-counts, and so is incredibly cheap ```rust # use arrow_array::Int32Array; let array = Int32Array::from_iter([1, 2, 3]); // Slice with offset 1 and length 2 let sliced = array.slice(1, 2); assert_eq!(sliced.values(), &[2, 3]); ``` # Downcasting an Array Arrays are often passed around as a dynamically typed [`&dyn Array`] or [`ArrayRef`]. For example, [`RecordBatch`](`crate::RecordBatch`) stores columns as [`ArrayRef`]. Whilst these arrays can be passed directly to the [`compute`], [`csv`], [`json`], etc... APIs, it is often the case that you wish to interact with the concrete arrays directly. This requires downcasting to the concrete type of the array: ``` # use arrow_array::{Array, Float32Array, Int32Array}; // Safely downcast an `Array` to an `Int32Array` and compute the sum // using native i32 values fn sum_int32(array: &dyn Array) -> i32 { let integers: &Int32Array = array.as_any().downcast_ref().unwrap(); integers.iter().map(|val| val.unwrap_or_default()).sum() } // Safely downcasts the array to a `Float32Array` and returns a &[f32] view of the data // Note: the values for positions corresponding to nulls will be arbitrary (but still valid f32) fn as_f32_slice(array: &dyn Array) -> &[f32] { array.as_any().downcast_ref::().unwrap().values() } ``` The [`cast::AsArray`] extension trait can make this more ergonomic ``` # use arrow_array::Array; # use arrow_array::cast::{AsArray, as_primitive_array}; # use arrow_array::types::Float32Type; fn as_f32_slice(array: &dyn Array) -> &[f32] { array.as_primitive::().values() } ``` # Alternatives to ChunkedArray Support The Rust implementation does not provide the ChunkedArray abstraction implemented by the Python and C++ Arrow implementations. The recommended alternative is to use one of the following: - `Vec` a simple, eager version of a `ChunkedArray` - `impl Iterator` a lazy version of a `ChunkedArray` - `impl Stream` a lazy async version of a `ChunkedArray` Similar patterns can be applied at the `RecordBatch` level. For example, [DataFusion] makes extensive use of [RecordBatchStream]. This approach integrates well into the Rust ecosystem, simplifies the implementation and encourages the use of performant lazy and async patterns. ```rust use std::sync::Arc; use arrow_array::{ArrayRef, Float32Array, RecordBatch, StringArray}; use arrow_array::cast::AsArray; use arrow_array::types::Float32Type; use arrow_schema::DataType; let batches = [ RecordBatch::try_from_iter(vec![ ("label", Arc::new(StringArray::from(vec!["A", "B", "C"])) as ArrayRef), ("value", Arc::new(Float32Array::from(vec![0.1, 0.2, 0.3])) as ArrayRef), ]).unwrap(), RecordBatch::try_from_iter(vec![ ("label", Arc::new(StringArray::from(vec!["D", "E"])) as ArrayRef), ("value", Arc::new(Float32Array::from(vec![0.4, 0.5])) as ArrayRef), ]).unwrap(), ]; let labels: Vec<&str> = batches .iter() .flat_map(|batch| batch.column(0).as_string::()) .map(Option::unwrap) .collect(); let values: Vec = batches .iter() .flat_map(|batch| batch.column(1).as_primitive::().values()) .copied() .collect(); assert_eq!(labels, ["A", "B", "C", "D", "E"]); assert_eq!(values, [0.1, 0.2, 0.3, 0.4, 0.5]); ``` [`ScalarBuffer`]: arrow_buffer::ScalarBuffer [`ScalarBuffer`]: arrow_buffer::ScalarBuffer [`OffsetBuffer`]: arrow_buffer::OffsetBuffer [`NullBuffer`]: arrow_buffer::NullBuffer [Arrow specification]: https://arrow.apache.org/docs/format/Columnar.html [`&dyn Array`]: Array [`NullBuffer`]: arrow_buffer::NullBuffer [`Buffer`]: arrow_buffer::Buffer [`compute`]: https://docs.rs/arrow/latest/arrow/compute/index.html [`json`]: https://docs.rs/arrow/latest/arrow/json/index.html [`csv`]: https://docs.rs/arrow/latest/arrow/csv/index.html [DataFusion]: https://github.com/apache/arrow-datafusion [RecordBatchStream]: https://docs.rs/datafusion/latest/datafusion/execution/trait.RecordBatchStream.html